Variation of Word Frequencies across Genre Classification Tasks

نویسندگان

  • Yunhyong Kim
  • Seamus Ross
چکیده

This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Linguistic Profiling of Texts Across Textual Genres and Readability Levels. An Exploratory Study on Italian Fictional Prose

In this paper we present a case study focusing on the literature genre, in particular on Italian fictional prose, aimed at identifying the features characterizing this text type. Identified features were tested in two classification tasks, i.e. by genre and by readability, with promising results. Interestingly, the same multi–level set of linguistic features turned out to reliably capture varia...

متن کامل

EFL Writing Styles across Personality Traits and Gender: A Case for Iranian Academic Context

The ways individuals use words can reflect basic psychological processes, including clues to their thoughts, feelings, perceptions, and personality. This paper seeks to determine whether there is a relationship between Iranian EFL learners' writing styles and their personality and gender.  It focuses on gender and two key dimensions of personality (Neuroticism and Extroversion), which were asse...

متن کامل

Predictive Power of Involvement Load Hypothesis and Technique Feature Analysis across L2 Vocabulary Learning Tasks

Involvement Load Hypothesis (ILH) and Technique Feature Analysis (TFA) are two frameworks which operationalize depth of processing of a vocabulary learning task. However, there is dearth of research comparing the predictive power of the ILH and the TFA across second language (L2) vocabulary learning tasks. The present study, therefore, aimed to examine this issue across four vocabulary learning...

متن کامل

Perform Three Data Mining Tasks with Crowdsourcing Process

For data mining studies, because of the complexity of doing feature selection process in tasks by hand, we need to send some of labeling to the workers with crowdsourcing activities. The process of outsourcing data mining tasks to users is often handled by software systems without enough knowledge of the age or geography of the users' residence. Uncertainty about the performance of virtual user...

متن کامل

Text Genre Detection Using Common Word Frequencies

In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus (Burrows, 1992). In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007